Data Summaries and Functions
Data Summaries
We've looked at a few graphical techniques for exploring data, and now we're going
to turn to a numerical one. Consider the question "Which day of the week has the
highest average box office for hit movies released on that day?". As a first step
in answering that question, it would be helpful to look at the mean box office
receipts for each of the days. If you look for a function to do that specific
task, you probably wouldn't find one, because R takes the more general approach of
providing a function that will allow you to calculate anything you want from vectors
of values broken down by groups. In fact, there are a variety of ways to do this.
The one we're going to look at is called aggregate. You pass aggregate
a vector or data frame containing the variables you want to summarize, a list of
the groups to summarize by, and the function you'd like to use for your summaries.
That way, a single function can perform many tasks, and, as we'll see when we
learn to write functions, it even allows R to do things that the developers of
R never imagined. For now, we'll stick to some built in functions, like
mean. To find the means for the box office receipts for each day of the
week, we could use a call to aggregate like this:
> aggregate(movies$box,movies['weekday'],mean)
weekday x
1 Monday 126.7766
2 Tuesday 104.8419
3 Wednesday 127.1272
4 Thursday 104.0686
5 Friday 102.6522
6 Saturday 82.2441
7 Sunday 103.0268
The same thing could be done to calculate other statistics, like median,
min, max, or any statistic that returns a single scalar value
for each group. Another nice feature of aggregate is that it if the
first argument is a data frame, it will calculate the statistic for each column
of the data frame. If we passed aggregate both the rank and
box, we'd get two columns of summaries:
> aggregate(movies[,c('rank','box')],movies['weekday'],mean)
weekday rank box
> aggregate(movies[,c('Rank','box')],movies['weekday'],mean)
weekday Rank box
1 Monday 443.1538 126.7766
2 Tuesday 511.7037 104.8419
3 Wednesday 455.0116 127.1272
4 Thursday 560.5122 104.0686
5 Friday 520.0766 102.6522
6 Saturday 596.1000 82.2441
7 Sunday 497.6667 103.0268
To add a column of counts to the table, we can create a data frame from
the table function, and merge it with the aggregated results:
> dat = aggregate(movies[,c('Rank','box')],movies['weekday'],mean)
> cts = as.data.frame(table(movies$weekday))
> head(cts)
Var1 Freq
1 Monday 13
2 Tuesday 27
3 Wednesday 172
4 Thursday 41
5 Friday 744
6 Saturday 10
To make the merge simpler, we rename the first column of cts to
weekday.
> names(cts)[1] = 'weekday'
> res = merge(cts,dat)
> head(res)
weekday Freq Rank box
1 Friday 744 520.0766 102.6522
2 Monday 13 443.1538 126.7766
3 Saturday 10 596.1000 82.2441
4 Sunday 12 497.6667 103.0268
5 Thursday 41 560.5122 104.0686
6 Tuesday 27 511.7037 104.8419
Finally, we can order the columns as follows:
> res[order(res$weekday),]
weekday Freq Rank box
2 Monday 13 443.1538 126.7766
6 Tuesday 27 511.7037 104.8419
7 Wednesday 172 455.0116 127.1272
5 Thursday 41 560.5122 104.0686
1 Friday 744 520.0766 102.6522
3 Saturday 10 596.1000 82.2441
4 Sunday 12 497.6667 103.0268
Functions
As you've already noticed, functions play an important role in R. A very
attractive feature of R is that you can write your own functions which
work exactly the same as the ones that are part of the official R release.
In fact, if you create a function with the same name as one that's already
part of R, it will override the built-in function, and possibly cause
problems. For that reason, it's a good idea to make sure that there's not
already another function with the name you want to use. If you type the
name you're thinking of, and R responds with a message like
"object "xyz" not found" you're probably safe.
There are several reasons why creating your own functions is a good idea.
- If you find yourself writing the same code over and over again as you work on
different problems, you can write a function that incorporates whatever it is
you're doing and call the function, instead of rewriting the code over and over.
-
All of the functions you create are saved in your workspace along with your data.
So if you put the bulk of your work into functions that you create, R will
automatically save them for you (if you tell R to save your workspace when your
quit.)
-
It's very easy to write "wrappers" around existing functions to make a custom
version that sets the arguments to another function to be just what you want.
R provides a special mechanism to "pass along" any extra arguments the other
function might need.
-
You can pass your own functions to built-in R functions like aggregate,
by, apply, sapply, lapply, mapply,
sweep and other functions to efficiently and easy perform customized
tasks.
Before getting down to the details of writing your own functions, it's a good idea
to understand how functions in R work. Every function in R has a set of arguments that
it accepts. You can see the arguments that built-in functions take in a number of ways:
viewing the help page, typing the name of the function in the interpreter, or using the
args function. When you call a function, you can simply pass it arguments,
in which case they must line up exactly with the way the function is designed, or
you can specifically pass particular arguments in whatever order you like by providing
the with names using the name=value syntax. You also can combine the two,
passing unnamed arguments (which have to match the function's definition exactly),
followed by named arguments in whatever order you like.
For example, consider the function read.table. We can view its argument list
with the command:
> args(read.table)
function (file, header = FALSE, sep = "", quote = "\"'", dec = ".",
row.names, col.names, as.is = !stringsAsFactors, na.strings = "NA",
colClasses = NA, nrows = -1, skip = 0, check.names = TRUE,
fill = !blank.lines.skip, strip.white = FALSE, blank.lines.skip = TRUE,
comment.char = "#", allowEscapes = FALSE, flush = FALSE,
stringsAsFactors = default.stringsAsFactors(), encoding = "unknown")
NULL
This argument list tells us that, if we pass unnamed arguments to
read.table, it will interpret the first as file, the next as
header, then sep, and so on. Thus if we wanted to read the
file my.data, with header set to TRUE and sep
set to ',', any of the following calls would be equivalent:
read.table('my.data',TRUE,',')
read.table(sep=',',TRUE,file='my.data')
read.table(file='my.data',sep=',',header=TRUE)
read.table('my.data',sep=',',header=TRUE)
Notice that all of the arguments in the argument list for read.table
have values after the name of the argument, except for the file argument. This
means that file is the only required argument to read.table; any of the
other arguments are optional, and if we don't specify them the default values that appear
in the argument list will be used.
Most R functions are written so the the
first few arguments will be the ones that will usually be used so that their
values can be entered without providing names, with the other arguments being optional.
Optional arguments can be passed to a function by position, but are much more commonly
passed using the name=value syntax, as in the last example of calling
read.table.
Now let's take a look at the function read.csv. You may recall that this
function simply calls read.table with a set of parameters that makes sense
for reading comma separated files. Here's read.csv's function definition,
produced by simply typing the function's name at the R prompt:
function (file, header = TRUE, sep = ",", quote = "\"", dec = ".",
fill = TRUE, comment.char = "", ...)
read.table(file = file, header = header, sep = sep, quote = quote,
dec = dec, fill = fill, comment.char = comment.char, ...)
<environment: namespace:utils>
Pay special attention to the three periods (...) in the argument list. Notice
that they also appear in the call to read.table inside the function's body.
The three dots mean all the arguments that were passed to the function that didn't
match any of the previous arguments in the argument list. So if you pass
anything other than file, header, sep, quote,
dec, or fill to read.csv, it will be part of the three dots;
by putting the three dots at the end of the argument list in the call to
read.table, all those unmatched arguments are simply passed along to
read.table. So if you make a call to read.csv like this:
read.csv(filename,stringsAsFactors=FALSE)
the stringsAsFactors=FALSE will get passed to read.table, even though
it wasn't explicitly named in the argument list. Without the three dots, R will not
accept any arguments that aren't explicitly named in the argument list of the function
definition. If you want to intercept the extra arguments yourself, you can include
the three dots at the end of the argument list when you define your function, and
create a list of those arguments inside the function body by refering to list(...).
Suppose you want to create a function that will call read.csv with a filename,
but which will automatically set the stringsAsFactors=FALSE parameter. For maximum
flexibility, we'd want to be able to pass other arguments (like na.strings=,
or quote=) to read.csv, so we'll include the three dots at the end of
the argument list. We could name the function read.csv and overwrite the built-in
version, but that's not a good idea, if for no other reason than the confusion it would
cause if someone else tried to understand your programs! Suppose we call the function
myread.csv. We could write a function definition as follows:
> myread.csv = function(file,stringsAsFactors=FALSE,...){
+ read.csv(file,stringsAsFactors=stringsAsFactors,...)
+ }
Now, we could simply use
thedata = myread.csv(filename)
to read a comma-separated file with stringsAsFactors=FALSE. You could still pass
any of read.table's arguments to the function (including stringsAsFactors=TRUE if
you wanted), and, if you ask R to save your workspace when you quit, the function will be
available to you next time you start R in the same directory.
File translated from